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Abstract 

Normalized information distance (NID) uses the theoretical notion of Kolmogorov complexity, which 
for practical purposes is approximated by the length of the compressed version of the file involved, using 
a real-world compression program. This practical application is called 'normalized compression distance' 
and it is trivially computable. It is a parameter-free similarity measure based on compression, and is used 
in pattern recognition, data mining, phylogeny, clustering, and classification. The complexity properties of 
its theoretical precursor, the NID, have been open. We show that the NID is neither upper semicomputable 
nor lower semicomputable up to any reasonable precision. 

Index Terms — Normalized information distance, Kolmogorov complexity, nonapproximability. 

I. Introduction 

The classical notion of Kolmogorov complexity [9] is an objective measure for the information in 
a single object, and information distance measures the information between a pair of objects [2]. This 
last notion has spawned research in the theoretical direction, among others [3], [15], [16], [17], [13], 
[14]. Research in the practical direction has focused on the normalized information distance (NID), 
also called the similarity metric, which arises by normalizing the information distance in a proper 
manner. If we also approximate the Kolmogorov complexity through real-world compressors [11], [4], 
[5], then we obtain the normalized compression distance (NCD). This is a parameter-free, feature-free, 
and alignment-free similarity measure that has had great impact in applications. The NCD was preceded 
by a related nonoptimal distance [10]. In [8] another variant of the NCD has been tested on all major 
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time-sequence databases used in all major data-mining conferences against all other major methods used. 
The compression method turned out to be competitive in general and superior in heterogeneous data 
clustering and anomaly detection. There have been many applications in pattern recognition, phylogeny, 
clustering, and classification, ranging from hurricane forecasting and music to to genomics and analysis 
of network traffic, see the many papers referencing [11], [4], [5] in Google Scholar. The NCD is trivially 
computable. In [11] it is shown that its theoretical precursor, the NID, is a metric up to negligible 
discrepancies in the metric (in)equalities and that it is always between and 1. 

The computability status of the NID has been open, see Remark VI. 1 in [11] which asks whether the 
NID is upper semicomputable, and (open) Exercise 8.4.4 (c) in the textbook [12] which asks whether 
the NID is semicomputable at all. We resolve this question by showing the following. 

Theorem 1.1: Let x,y be binary strings of length n and denote the NID between them by e(x,y). 

(i) There is no upper semicomputable function g such that \g(x, y) — e(x, y) \ < (log n)/n (Lemma [3 .11 1. 

(ii) There is no lower semicomputable function g such that \g{x,y) — e(x,y)\ < 1/2 (Lemma I4.3I ). 

II. Preliminaries 

We write string to mean a finite binary string, and e denotes the empty string. The length of a string 
x (the number of bits in it) is denoted by \x\. Thus, |e| = 0. Moreover, we identify strings with natural 
numbers by associating each string with its index in the length-increasing lexicographic ordering 

(e, 0), (0, 1), (1, 2), (00, 3), (01, 4), (10, 5), (11, 6), . . . . 

Informally, the Kolmogorov complexity of a string is the length of the shortest string from which the 
original string can be losslessly reconstructed by an effective general-purpose computer such as a particular 
universal Turing machine U, [9]. Hence it constitutes a lower bound on how far a lossless compression 
program can compress. In this paper we require that the set of programs of U is prefix free (no program 
is a proper prefix of another program), that is, we deal with the prefix Kolmogorov complexity. (But for 
the results in this paper it does not matter whether we use the plain Kolmogorov complexity or the prefix 
Kolmogorov complexity.) We call U the reference universal Turing machine. Formally, the conditional 
prefix Kolmogorov complexity K{x\y) is the length of the shortest input z such that the reference universal 
Turing machine U on input z with auxiliary information y outputs x. The unconditional prefix Kolmogorov 
complexity K{x) is defined by K(x\e). For an introduction to the definitions and notions of Kolmogorov 
complexity (algorithmic information theory) see [12]. 
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Let M and 1Z denote the nonnegative integers and the real numbers, respectively. A function / : J\f — > 
1Z is upper semicomputable (or IT^) if it is defined by a rational-valued computable function <p(x, k) 
where x is a string and k is a nonnegative integer such that <j)(x, k + 1) < (p(x, k) for every k and 
Hindoo cf)(x, k) = f(x). This means that / can be computably approximated from above. A function / 
is lower semicomputable (or X°) if — / is upper semicomputable. A function is called semicomputable 
(or Tl\ |J S] 1 ) if it is either upper semicomputable or lower semicomputable or both. A function / is 
computable (or recursive) iff it is both upper semicomputable and lower semicomputable (or f] 
Use (•) as a pairing function over AT to associate a unique natural number (x, y) with each pair (x, y) of 
natural numbers. An example is (x, y) defined by y + {x + y + l)(x + y)/2. In this way we can extend 
the above definitions to functions of two nonnegative integers, in particular to distance functions. 

The information distance D(x,y) between strings x and y is defined as 

D(x,y) = min{|p| : U(p,x) =y AU(p,y) = x}, 
p 

where U is the reference universal Turing machine above. Like the Kolmogorov complexity K, the 
distance function D is upper semicomputable. Define 

E(x,y) =m&x{K(x\y),K(y\x)}. 

In [2] it is shown that the function E is upper semicomputable, D(x,y) = E(x,y) + 0(E(x,y)), the 
function E is a metric (more precisely, that it satisfies the metric (in)equalities up to a constant), and that 
E is minimal (up to a constant) among all upper semicomputable distance functions D' satisfying the 

mild normalization conditions Ylyy^x^~ D '^ X ' V ^ — ^ anc * Ylx-x^y^~ D '^ X ' V ^ — ^ ^ e minimality property 
was relaxed from metrics [2] to symmetric distances [11] to the present form [12] without serious proof 
changes). The normalized information distance e is defined by 

/ x = E(x,y) 
e[X,y> m&x{K(x),K(y)}' 

It is straightforward that < e(x,y) < 1 up to some minor discrepancies for all x,y <G {0, 1}*. Since 
e is the ratio between two upper semicomputable functions, that is, between two functions, it is 
a A° function. That is, e is computable relative to the halting problem 0'. One would not expect any 
better bound in the arithmetic hierarchy. Call a function f(x,y) computable in the limit if there exists a 
rational-valued computable function g(x, y, t) such that lim^oo g(x, y, t) = f(x, y). This is precisely the 
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class of functions that are Turing-reducible to the halting set, and the NID is in this class, Exercise 8.4.4 
(b) in [12] (a result due to [7]). 



III. NONAPPROXIM ABILITY OF THE NID FROM ABOVE 

Lemma 3.1: There is no upper semicomputable function g(x,y) such that \e(x,y) — g(x,y)\ < 
(log n)/n for all strings x,y of length n. 

Proof: For simplicity we use e(x,x) = 1/K(x). 
By Theorem 3.8.1 in [12] (a result due to [6]), for every n there is an x of length n such that 

ry 

K(K(x)\x)>log- + 0(1). (III.l) 

log n 

Assume there is an upper semicomputable function g{x) such that 

log n 



\e(x, x) — g{x)\ < 



n 



with n = \x\. That is, \1/K(x) — g{x)\ < (logn)/n. Then, 1/K(x) is upper semicomputable within 
distance (log n)/n. Therefore, K(x) is lower semicomputable within distance n/logn. Since K(x) is 
also upper semicomputable, it is computable within distance n/logn. But this violates dill. 1 1 ) because 
we can describe the given distance in log(n/logn) bits. (We can round g(x) up or down and indicate 
the direction of rounding with one bit.) ■ 

IV. NONAPPROXIM ABILITY OF THE NID FROM BELOW 

Let x be a string of length n and i(n) a computable time bound. Then K l denotes the time bounded 
version of K defined by 

ivT*(x) = min{b| : U(p) = x in at most tin) steps}. 
v 

The computation of U is measured in terms of the output rather than the input, which is more natural in 
the context of Kolmogorov complexity. Define the time bounded version E l of E by 

E\x,y) = max{if*(x|y),ir*(y|z)}. 



Lemma 4.1: For every length n and computable time bound t(n) there are strings x and a of length 
n such that 
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• K(x) > n, 

• K(x\a) > n, 

. K(a\n) = 0(1), 

. K^alx) > n-0(l). 

Proof: See [1], Lemma 7.7. ■ 
Lemma 4.2: For every length n and large enough computable time bound t(n), there exist strings x 
and y of length n such that 

• K(x) > n, 

. £?(a;,y) = 0(l), 

• y) > n — 0(1) (where the constant in the big-0 depends on t but not on n) 

Proof: Let x and a be as in Lemma 14.11 using 2t(n) instead of t(n). In this way, we have 
K 2t (a\x) > n — 0(1). Define y by y = x © a where © denotes the bitwise XOR. Then, 

E(x,y) < K(a\n) + 0(1) = O(l). 

We also have a = x © y so that (with the time bound t large enough) 

n-O(l) < K 2t (a\x) 

< K\y\x) + 0(l) 

< znss.{K t {x\y),K t (y\x)} + 0(l) 
= E t (x,y) + 0{1). 

■ 

Lemma 4.3: There is no lower semicomputable function g(x,y) such that \e(x,y) — g(x,y)\ < ^ for 
all strings x, y. 

Proof: Assume by way of contradiction that the lemma is false for some strings x, y of length n with 
x and y satisfying the conditions in Lemma l4~2l Let g(x, y) + 5 = e(x, y) for some 5 with -| < 5 < |. 
Let gi be a lower semicomputable function approximation of g such that gi+i{x,y) > gi(x,y) for all i 
and linij^oo gi(x, y) = g(x, y). Let E{ be an upper semicomputable function approximating E such that 
Ei + \(x : y) < Ei(x,y) for all i and lim^oo Ei(x, y) = E(x,y). Finally, let s = s(x,y) be the least i 
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such that 



9s{x,y) + 5> 



E s {x,y) 



n + 21ogn + 0(l)' 



Here "log" denotes the binary logarithm. Since K(z) < n + 21ogn + O(l) for every string z of length 
n, see [12], and Hindoo gi(x, y) +5 = e(x,y), such an s exists by the contradictory assumption. 
Claim 4.4: There is a constant c depending on s but not on x and y such that E s (x,y)) > n — c. 
Proof: Define a computable time bound t(n) such that 



for all strings u, v of length n, and c a constant to be determined below, as follows. If E s (u, v) < n — c, 
then E(u, v) < n — c. There is a program witnessing E s (u, v) < n — c, and we can define t(u, v) to be 
the running time of this program. Let t(n) be the maximum of the t(u, w)'s for all pairs u,v such that 
E s (u,v) < n — c with s as in (UV.ll) . By Lemma l4~2l we have E l {x,y) > n — c 1 for some constant d 
depending on t but not on n (and x and y). Set c = c'. Then, by (II V. 1 b we have E s (x, y)) > n — c. ■ 
Recall that e(x,y) = g(x,y) + 5. Consider the sequence of inequalities 

E(x,y)—5n > g(x,y)-n 

> 9s{x,y)-n 

E s (x,y)-n 



for n large enough. The first inequality holds since K{x) > n. The last inequality holds since E s (x, y) > 



A subset of AT is called n-computably enumerable (n-c.e.) if it is a Boolean combination of n 
computably enumerable sets. Thus, the 1-c.e. sets are the computably enumerable sets, the 2-c.e. sets (also 
called d.c.e.) the differences of two c.e. sets, and so on. The n-c.e. sets are referred to as the difference 
hierarchy over the c.e. sets. This is an effective analog of a classical hierarchy from descriptive set theory. 
Note that a set is n-c.e. if it has a computable approximation that changes at most n times. 




(IV. 1) 



n + 21ogn + 0(l) 
Q(n) — 5n, 



n — c by Claim 14.41 and n/(n + 21ogn + 0(1)) > \ for n large enough. Since we have shown that 
E(x,y) = Cl(n), we contradict E(x,y) = O(l) by Lemma I4T21 ■ 



V. Open Problem 
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We can extend the notion of n-c.e. set to a notion that measures the number of fluctuations of a function 
as follows: For every n > 1, call / : M — > 1Z n-approximable if there is a rational-valued computable 
approximation qb such that lim^ cf)(x, k) = f(x) and such that for every x, the number of /c's such that 
cj)(x, k + 1) — (j)(x, k) < is bounded by n — 1. That is, n — 1 is a bound on the number of fluctuations of 
the approximation. Note that the 1-approximable functions are precisely the lower semicomputable (Sj 1 ) 
ones (zero fluctuations). Also note that a set A C J\f is n-c.e. if and only if the characteristic function 
of A is n-approximable. 

Conjecture For every n > 1, the normalized information distance e is not n-approximable. 
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